80 research outputs found

    A Comparison of Feature-Based and Neural Scansion of Poetry

    Full text link
    Automatic analysis of poetic rhythm is a challenging task that involves linguistics, literature, and computer science. When the language to be analyzed is known, rule-based systems or data-driven methods can be used. In this paper, we analyze poetic rhythm in English and Spanish. We show that the representations of data learned from character-based neural models are more informative than the ones from hand-crafted features, and that a Bi-LSTM+CRF-model produces state-of-the art accuracy on scansion of poetry in two languages. Results also show that the information about whole word structure, and not just independent syllables, is highly informative for performing scansion.Comment: RANLP 201

    Semantikan oinarritutako bilaketak: Kyoto proiektua

    Get PDF
    Semantic-based research: Kyoto Project. In the digital management of documentation, the use of the text itself can be very interesting, in addition to the descriptors. Many descriptors are also text. The use of linguistic engineering techniques opens up new options for accessing information from these databases: multilingual access, semantic grouping, access based on similarity, question-answer systems, information inference, etc. This paper looks in more detail at the possibilities based on semantics, setting out the research areas being developed by the authors as part of the European Kyoto project

    Semantikan oinarritutako bilaketak: Kyoto proiektua

    Get PDF
    Semantic-based research: Kyoto Project. In the digital management of documentation, the use of the text itself can be very interesting, in addition to the descriptors. Many descriptors are also text. The use of linguistic engineering techniques opens up new options for accessing information from these databases: multilingual access, semantic grouping, access based on similarity, question-answer systems, information inference, etc. This paper looks in more detail at the possibilities based on semantics, setting out the research areas being developed by the authors as part of the European Kyoto project

    Strategies to develop Language Technologies for Less-Resourced Languages based on the case of Basque

    Get PDF
    IXA group has developed during 23 years a basic set of resources, tools and applications for Basque following to an initial strategy which has been adapted according to technological changes. We think that our strategy and experience can be a reference for other less resourced languages. According to a six level classification of world languages, we estimate that this strategy may be useful for several hundred languages, those that have developed a written standard but that still are beginners in Human Language Technology

    A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity

    Get PDF
    This is an Accepted Manuscript of an article published by Taylor & Francis in Journal of Quantitative Linguistics on 01 Mar 2020, available online: http://www.tandfonline.com/10.1080/09296174.2020.1732177The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three languages. The three historical corpora have been constructed and collected with the closest spelling to the original on a balanced basis of fiction and non-fiction. This methodology has been applied to measure the historical distance of Galician with respect to Portuguese and Spanish, from the Middle Ages to the end of the 20th century, both in original spelling and automatically transcribed spelling. The quantitative results are contrasted with hypotheses extracted from experts in historical linguistics. Results show that Galician and Portuguese are varieties of the same language in the Middle Ages and that Galician converges and diverges with Portuguese and Spanish since the last period of the 19th century. In this process, orthography plays a relevant role. It should be pointed out that the method is unsupervised and can be applied to other languagesThis work has received financial support from DOMINO project [PGC2018-102041-B-I00, MCIU/AEI/FEDER, UE]; eRisk project [RTI2018-093336-B-C21]; the Consellería de Cultura, Educación e Ordenación Universitaria (accreditation 2016-2019, ED431G/08, Consolidation and structuring of Groups with Growth Potential: 745ED431B 2017/39) and the European Regional Development Fund (ERDF)S

    A spelling corrector for basque based on morphology

    Get PDF
    This paper describes the components used in the elaboration of the commercial Xuxen spelling checker/corrector for Basque. Because Basque is a highly inflected and agglutinative language, the spelling checker/corrector has been conceived as a by-product of a general purpose morphological analyser/generator. The spelling checker/corrector performs morphological decomposition in order to check misspellings and, to correct them, uses a new strategy which combines the use of an additional two-level morphological subsystem for orthographic errors, and the recognition of correct morphemes inside the world-form during the generation of proposals for typographical errors. Due to a late process of standardization of Basque, Xuxen is intended as a useful tool for standardization purposes of present day written Basque

    Teknologia garatzeko estrategiak baliabide urriko hizkuntzetarako: euskararen eta Ixa taldearen adibidea

    Get PDF
    El artículo comienza presentando varios datos que muestran la situación de la lengua vasca, y a continuación proponiendo una clasificación para las lenguas del mundo según sea su presencia en Internet y en la tecnología de la lengua. El cuerpo del artículo presenta el trabajo hecho por el grupo Ixa en el campo del procesamiento automático del euskara, identificando sus siete hitos principales y describiendo la estrategia que ha guiado este desarrollo. Se plantea que esta estrategia puede servir como referencia para 190 lenguas que según la lasificación propuesta no poseen recursos de tecnología de la lengua pero si poseen una mínima presencia significativa en Internet.Euskararen egoeraren inguruan hainbat datu ematen dira labur-labur, eta horrekin batera munduko hizkuntzak sailkatzeko proposamen bat aurkezten da Interneten eta hizkuntz teknologian duten egoeren araberakoa. Euskararen prozesaketa automatikoan Ixa taldeak izan duen bilakaeraren nondik norakoak zehazten dira gero, hainbat mugarri azpimarratuz eta ibilbide hori jarraitzeko erabili den estrategia deskribatuz. Munduko 190 hizkuntzentzat erreferentzia izan daiteke estrategia hori, hain zuen, Interneten presentzia minimo eduki bai baina oraindik hizkuntza-teknologia mota hau landu ez duten hizkuntzentzat

    TweetMT : a parallel microblog corpus

    Get PDF
    We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested

    Introducción a la tarea compartida Tweet-Norm 2013: Normalización léxica de tuits en español

    Get PDF
    En este artículo se presenta una introducción a la tarea Tweet-Norm 2013 : descripción, corpora, anotación, preproceso, sistemas presentados y resultados obtenidos.Postprint (published version

    Massively multilingual accessible audioguides via cell phones

    Get PDF
    Bidaide is a web service that allows the visitors of a museum, route or building to read or listen to explanations relative to the visited place on their own mobile and in their own language. The visitor can access the explanations in various ways: by scanning some QR codes located in the place, by GPS positioning (in outdoor routes), or by automatic Bluetooth proximity activation. This makes it accessible for people with reduced or null vision. On the other hand, this platform also offers to the manager of the visited site the most advanced language resources to create the texts and audios of the explanations in many languages
    • …
    corecore